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Description 

The present invention relates generally to speech synthesis, and particularly to methods and systems for converting 
textual data into synthetic speech. 

5 

BACKGROUND OF THE INVENTION 

The automatic conversion of text to synthetic speech is commonly known as text to speech (TTS) conversion or 
text to speech (TTS) synthesis. A number of different techniques have been developed to make TTS conversion prac- 
10 tical on a commercial basis. An excellent article on the history of TTS development, as well as the state of the art in 
1 987, is Dennis H. Klatt, Review of text-to-speech conversion for English, Journal of the Acoustical Society of America 
vol. 82(3), September 1 987. 

A number of commercial products use TTS techniques, including the Speech Plus Prose 2000 (made by the as- 
signee of the applicants), the Digital Equipment DECTalk, and the Infovox SA-101 . 

15 

Overview of Prior Art TTS 

Referring to Figure 1 most commercial TTS products first convert text into a stream of phonemes (with represen- 
tations for emphasis and stress) and then use a "synthesis by rule" technique for converting the phonemes into synthetic 
20 speech. For example, in the Speech Plus Prose 2000 Text-to-Speech Converter the first step of the TTS process is 
text normalization (box 20), which expands abbreviations to their full word form. The Text Normalization routine 20 
expands numbers, monetary amounts, punctuation and other non-alphabetic characters into their full word equivalents. 

Most words are converted to phonemes by a set of Word to Phoneme Rules 24. However, the pronunciation of 
some words do not follow the standard rules. The phoneme strings for these special words are stored in a Dictionary 
25 Look-Up Table 22. In a typical TTS system, 3000 to 5000 such words are stored in the Dictionary 22. Thus, using either 
the Dictionary 22 or the Phoneme Rules 24 for each particular word, all text input is converted into phoneme strings. 

The Word-Level Stress Assignment routine 26 assigns stress to phonemes in the phoneme string. Variations in 
assigned stress result in pitch and duration differences that make some sounds stand out from others. 

It is well known that the pronunciation of phonemes in human (or natural) speech is context dependent. To mimic 
30 natural speech, the synthetic pronunciation of each phoneme is determined by a set of rules which analyze the phonetic 
context of the phoneme. The Allophonics routine 28 assigns allophones to at least a portion of the consonant phonemes 
in the phoneme string 25. 

Allophones are variants of phonemes based on surrounding speech sounds. For instance, the aspirated "p" of the 
word pit and the unaspirated "p u of the word spit are both allophones of the phoneme "p". 

35 One way to try to make synthetic speech sound more natural is to "assign" or generate allophones for each pho- 

neme based on the surrounding sounds, as well as the speech rate, syntactic structure and stress pattern of the 
sentence. Some prior art TTS products, such as the Speech Plus Prose 2000, assign allophones to certain consonant 
phonemes based on the context of those phonemes. In other words, an allophone is selected for a particular consonant 
phoneme based on the context of that phoneme in a particular word or sentence. 

40 The Sentence-Level Prosodies rules 30 in the Speech Plus Prose 2000 determine the duration and fundamental 

frequency pattern of the words to be spoken. The resultant intonation contour gives sentences a semblance of the 
rhythm and melody of a human speaker. The prosodies rules 30 are sensitive to the phonetic form and the part of 
speech of the words in a sentence, as well as the speech rate and the type of the prosody selected by the user of the 
system. 

45 The Parameter Generator 40 accepts the phonemes specified by the early portions of the TTS system, and pro- 

duces a set of time varying speech parameters using a "constructive synthesis" algorithm. In other words, an algorithm 
is used to generate context dependent speech parameters instead of using pieces of prestored speech. The purpose 
of the constructive synthesis algorithm is to model the human vocal tract and to generate human sounding speech. 
The speech parameters generated by the Parameter Generator 40 control a digital signal processor known as a 

so Formant Synthesizer 42 because it generates signals which mimic the formants (i.e., resonant frequencies of the vocal 
tract) characteristic of human speech. The Formant Synthesizer outputs a speech waveform 44 in the form of an 
electrical signal that is used to drive a audio speaker and thereby generates audible synthesized speech. 

Diphone Concatenation. Another technique for TTS conversion is known as diphone concatenation. A diphone is 
the acoustic unit which spans from the middle of one phoneme to the middle of the next phoneme. TTS conversion 

55 systems using diphone concatenation employ anywhere from 1 000 to 8000 distinct diphones. In diphone concatenation 
systems, each diphone is a stored as a chunk of encoded real speech recorded from a particular person. Synthetic 
speech is generated by concatenating an appropriate string of diphones. Due to the fact that each diphone is a fixed 
package of encoded real speech, diphone concatenation has difficulty synthesizing syllables with differing stress and 
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timing requirements. While some experimental diphone concatenation systems have good voice qualities, the inherent 
timing and stress limitations of concatenation systems have limited their commercial appeal. Some of the limitations 
of diphone concatenation systems may be overcome by increasing the number of diphones used so as to include 
similar diphones with different durations and fundamental frequencies, but the amount of memory storage required 
s may be prohibitive. 

A similar technique, called demisyllable concatenation employs demisyllables instead of diphones. A demisyllable 
is the acoustic unit which spans from the start of a consonant to the middle of the following vowel in a syllable, or from 
the middle of a vowel to the end of the following consonant in a syllable. 

One reason for the prevalence of TTS systems which use "synthesis by rule" techniques, as opposed to diphone 
10 or demisyllable concatenation systems, is that synthesis by rule provides a greater ability to vary timing, intonation and 
allophonic detail - all of which are important to making synthetic speech intelligible, variable and pleasant to listen to. 
In addition, it has been demonstrated that the synthesis of phonemes follows certain patterns that can be generalized 
and represented by a set of rules. 

Generally, diphone concatenation systems and synthesis by rule systems have different strong points and weak- 
nesses. Diphone concatenation systems can sound like a person when the proper diphones are used because the 
speech produced is "real" encoded speech recorded from the person that the system is intended to mimic. Synthesis 
by rule systems are more flexible in terms of stress, timing and intonation, but have a machine-like quality because 
the speech sounds are synthetic. 

An embodiment of the present invention can be thought of as a hybrid of the synthesis by rule and diphone con- 
20 catenation techniques. 

Instead of using encoded (i.e., stored real speech) diphones, the present invention incorporates into a synthesis 
by rule system vowel allophones that are synthetic, but which resemble the full allophonic repertoire of a particular 
person. 

25 Vowel Allophones 

To a large degree, the prior art TTS systems and techniques generate allophones only for consonant phonemes. 
Vowel phonemes are generally given a static representation (i.e., are represented by a fixed set of formant frequency 
and bandwidth values), with "allophones" being formed by "smoothing" the vowel's formants with those of the neigh- 
30 boring phonemes. 

More precisely, the fixed representation of each vowel phoneme is a partial set of formant frequency and bandwidth 
values which are derived by analyzing and selecting or averaging the formant values of one or more persons when 
speaking words which include that vowel phoneme. Vowel allophones (i.e., context dependent variations of vowel 
phonemes) are generated in the prior art systems, if they are generated at all, by formant smoothing. Formant smoothing 
35 is a curve fitting process by which the back and forward boundaries of the vowel phoneme (i.e., the boundaries between 
the vowel phoneme and the prior and following phonemes) are modified so as to smoothly connect the vowel's formants 
with those of its neighbors. 

One embodiment of the present invention, on the other hand, stores an encoded form of every possible allophone 
in the English (or any other) language. While this would appear to be impractical, at least from a commercial viewpoint, 

40 the embodiment provides a practical method of storing and retrieving every possible vowel allophone. More specifically, 
a vowel allophone library is used to store distinct allophones for every possible vowel context. When synthesizing 
speech, each vowel phoneme is assigned an allophone by determining the surrounding phonemes and selecting the 
corresponding allophone from the vowel allophone library. 

The inventors have found that using a large library of encoded vowel allophones, rather than a small set of static 

45 vowel phonemes, greatly improves the intelligibility and naturalness of synthetic speech. It has been found that the 
use of encoded vowel allophones reduces the machine-like quality of the synthetic speech generated by TTS conver- 
sion. 

In the context of Figure 1, the inventors have improved the parameter generator 40 of the prior art Speech Plus 
Prose 2000 system by adding a vowel allophone capability. Thus the generation of vowel allophones is handled sep- 
so arately from the generation of consonant allophonics by Allophonics module 28. 

More generally, though, the invention does not depend on the exact TTS technique being used in that it provides 
a system and method for replacing the static vowel phonemes in prior art TTS systems with context dependent vowel 
allophones. 

It is therefore a primary object of one embodiment of the present invention to improve the quality and intelligibility 
55 of the synthetic speech produced by TTS conversion systems. 

An object of another embodiment of the present invention is to improve the quality and intelligiblity of synthetic 
speech produced by TTS conversion systems by generating context dependent vowel allophones. 

An object of another embodiment of the present invention is to provide a large library of vowel allophones and a 
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technique for assigning allophones in the library to the vowel phonemes in a phrase that is to be synthetically enunci- 
ated, so as to generate natural sounding vowel phonemes. 

An object of another embodiment of the present invention is to provide a TIS conversion system that sounds like 
a particular person. A related object is provide a methodology for adapting TTS conversion systems to make them 
sound like particular individuals. 

An object of yet another embodiment of the present invention is to provide a practical method and system for 
storing and retrieving a large library of vowel allophones, representing all or practically all of the vowel allophones in 
a particular language, so as enable use of the present invention in commercial applications. 

SUMMARY OF THE INVENTION 

The present invention provides a text-to-speech synthesis system, comprising text conversion means for convert- 
ing a specified text into a corresponding string of consonant and vowel phonemes, each said phoneme being selected 
from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; 
parameter generating means for generating speech parameters corresponding to said string of phonemes; and speech 
synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said 
parameter generating means; characterized by vowel allophone storage means storing a multiplicity of predefined 
vowel allophones, each vowel allophone being represented by a set of speech parameters; said vowel allophones 
including allophones for a multiplicity of vowel phonemes; vowel phoneme to allophone conversion means, coupled 
to said text conversion means and said vowel allophone storage means, for computing a phoneme context value for 
each of at least a subset of said vowel phonemes in said string of phonemes, said phoneme context value comprising 
a function of the phonemes in said string of phonemes which precede and follow said vowel phoneme, and for then 
assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said computer 
phoneme context value; said parameter generating means including means for generating speech parameters for said 
assigned vowel allophones. 

The present invention also provides a method of converting text strings into synthetic speech, the steps comprising: 
defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; 
converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant 
and vowel phonemes, each said phoneme being selected from said defined set of phonemes; and converting said 
string of phonemes into speech parameters and then generating an audio waveform corresponding to said speech 
parameters; characterized by: storing a multiplicity of predefined vowel allophones, each vowel allophone being rep- 
resented by a set of speech parameters; for each of at least a subset of said vowel phonemes in said string of phonemes, 
computing a phoneme context value for said vowel phoneme as a function of the phonemes in said string of phonemes 
which precede and follow said vowel phoneme, and then assigning to said vowel phoneme a selected one of said 
predefined vowel allophones corresponding to said computed phoneme context value; and said converting step in- 
cluding converting said assigned vowel allophones into speech parameters which are then used to generate an audio 
waveform corresponding to said speech parameters. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Embodiments of the present invention will now be described with reference to the drawings, in which:- 

Figure 1 is a flow chart of the text to speech conversion process. 

Figure 2 is a block diagram of a system for performing text to speech conversion. 

Figure 3 depicts a spectrogram showing one vowel allophone. 

Figure 4 depicts one formant of a vowel allophone. 

Figure 5 is a block diagram of one formant code book and an allophone with a pointer to an item in the code book. 
Figure 6 is a block diagram of the vector quantization process for generating a code book of vowel allophone 
formant parameters. 

Figure 7A, 7B and 10 are block diagrams of the process for generating the formant parameters for a specified 
vowel allophone. 

Figure 8 is a block diagram of an allophone context map data structure and a related duplicate context map. 

Figure 9 is a block diagram of a vowel context data table. 

Figure 10 is a block diagram of an alternate LLRR vowel context table. 

Figure 11 is a block diagram of the process for generating speech parameters for a specified vowel allophone in 
an alternate embodiment of the present invention. 
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DESCRIPTION OF THE PREFERRED EMBODIMENT 

Referring to Figure 2, the preferred embodiment of the present invention is a reprogrammed version of the Speech 
Plus Prose 2000 product, which is a TTS conversion system 50. The basic components of this system are a CPU 
s controller 52 which executes the software stored in a Program ROM 54. Random Access Memory (RAM) 56 provides 
workspace for the tasks run by the CPU 52. Information, such as text strings, is sent to the TTS conversion system 50 
via a Bus Interface and I/O Port 58. These basic components of the system 50 communicate with one another via a 
system bus 60, as in any microcomputer based system. 

Note that boxes 20 through 40 in Figure 1 comprise a computer (represented by boxes 52, 54 and 56 in Figure 2) 
10 programmed with appropriate TTS software. It is also noted that the TTS software may be downloaded from a disk or 
host computer, rather than being stored in a Program ROM 54. 

Also coupled to the system bus 60 is a Formant Synthesizer 62, which is a digital signal processor that translates 
formant and other speech parameters into speech waveform signals that mimic human speech. The digital output of 
the Formant Synthesizer 62 is converted into an analog signal by a digital to analog converter 64, which is then filtered 
15 by a low pass filter 66 and amplified by an audio amplifier 68. The resulting synthetic speech waveform is suitable for 
driving a standard audio speaker. 

One embodiment of the present invention synthesizes speech from text using a variation of the process shown in 
Figure 1 . In the preferred embodiment vowel allophones are assigned to vowel phonemes by an improved version of 
the parameter generator 40. In terms of the sequence of process steps, the vowel allophone assignment process takes 
20 place between blocks 30 and 40 in Figure 1 . 

As explained above, one embodiment of the present invention generates improved synthetic speech by replacing 
the fixed formant parameters for vowel phonemes used in the prior art with selected formant parameters for vowel 
allophones. The vowel allophones are selected on the basis of the "context" of the corresponding phoneme - i.e., the 
phonemes preceding and following the vowel phoneme that is being processed. 
25 To understand the magnitude of this task, consider the following. Assume for the purposes of this example that 

the context of a vowel phoneme is defined solely by the phonemes immediately preceding and following the vowel 
phoneme. The preferred embodiment of the invention uses 57 phonemes (including 23 vowel phonemes, 33 consonant 
phonemes, and silence). For each vowel (i.e., vowel phoneme) there are 31 36 (i.e. , 56 x 56) possible phoneme-vowel- 
phoneme (PUP) contexts. In other words, there are 3136 possible allophones for each of the 23 vowel phonemes, or 
30 a total of 72, 1 28 vowel allophones. 

In the preferred embodiment, and many commercial products, the enunciation of a vowel phoneme is represented 
by four formants, requiring approximately 40 bytes to store each vowel allophone. The data structure for storing a 
single phoneme enunciation (i.e., allophone) is described in more detail below. Without using some form of data com- 
pression, it would require nearly three megabytes of memory to store the 72,128 possible vowel -allophones. In most 
35 commercial applications, it is currently not practical to use so much memory just to store a library of vowel allophones. 
It should be noted that in many commercial applications, a TTS system is an "add-on board" which must occupy a 
relatively small amount of space and must cost less than a typical desktop computer. 

One embodiment of the present invention provides a practical and relatively low cost method of storing and ac- 
cessing the data for all 72,128 vowel allophones, using allophone data tables which occupy about one tenth of the 
space which would be required in a system that did not use data compression. Before explaining how this is done, it 
is first necessary to review the data used to represent vowel allophones. 

Speech Formant Parameters 

45 Figure 3 shows a somewhat simplified model of the speech spectrogram 80 for one vowel allophone. The speech 

spectrogram 80 shows four formants f1, (2, (3 and f4. As shown, each formant has a distinct frequency "trajectory", 
and a distinct bandwidth which varies over the duration of the allophone. The frequency trajectory and bandwidth of 
each formant directly correlate with the way that formant sounds. 

To store and retrieve any sound, one can simply record the soundwave and play it back. However, that is not 
50 practical when building a library of over 72,000 allophones because of the huge volume of memory which would be 
required to store the digital samples. 

Rather, speech waveforms can be reconstructed from information stored in a much more compressed form be- 
cause of knowledge about their structure and production. In particular, one standard method of reconstructing a speech 
waveform is to record the frequency trajectory of each formant, plus the bandwidth trajectory of at least the lower two 
55 or three formants. Then the waveform is synthesized by using the frequency and bandwidth trajectories to control a 
formant synthesizer. This method works because the formant frequencies are the resonant frequencies of the vocal 
tract and they characterize the shape of the vocal tract as it changes to produce the speech waveform. 

Referring to Figures 3 and 4, in the present invention each individual allophone formant is represented by six 
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frequency measurements (bbx, v1x, v2x, v3x, v4x and fbx), four time measurements (tlx, t2x, t3x and t4x), and three 
bandwidth measurements (b3x, b5x and b7x), where V identifies the formant. These measurements trace the fre- 
quency trajectory of the formant, as well as changes in its bandwidth. 

Table 1 lists the measurement parameters for a single allophone formant and describes the measured quantity 
represented by each parameter. 

Table 2 lists the full set of parameters for an allophone. As shown, this includes the parameters for four formants. 
Note that no bandwidth parameters are included for the fourth formant f4. The bandwidth of the fourth formant is treated 
as a constant value as it varies little compared with the bandwidth of the other three formants. 

TABLE 1 



DATA FOR ONE ALLOPHONE FORMANT (x) 


Parameter 


Description 


bbx 


frequency at back boundary of allophone 


v1x 


frequency at time t1 


tlx 


time of measurement v1 


v2x 


frequency at time t2 


t2x 


time of measurement t2 


v3x 


frequency at time t3 


t3x 


time of measurement v3 


v4x 


frequency at time t4 


t4x 


time of measurement v4 


fbx 


frequency at forward boundary of allophone 


b3x 


bandwidth 30 milliseconds after back boundary 


b5x 


bandwidth 50 percent of the way through the duration of the allophone 


b7x 


bandwidth 70 percent of the way through the duration of the allophone 



TABLE 2 



DATA FOR ONE ALLOPHONE - FOUR FORMANTS 



FORMANT 



Parameters 



bb1 , v1 1 ,t11 , v21 ,t21 , v31 ,t31 , v41 , t41 , fb1 , b31 , b51 , b71 
bb2, v12, t12, v22,t22, v32,t32, v42, t42, fb2, b32, b52, b72 
bb3, v13,t13, v23,t23, v33, t33, v43, t43, fb3, b33, b53, b73 
bb4, v14,t14, v24, t24, v34,t34, v44, t44, fb4 



45 Data Compression 

Using Vector Quantization 

To store the parameters listed in Table 2 for a single allophone requires 38 bytes: 8 bytes for the eight forward and 
50 back boundary values, 16 bytes for the sixteen intermediate frequency values, 8 bytes for the sixteen intermediate 
time values (4 bits each), and 6 bytes for the three sets of bandwidth values. Table 3 shows how each measurement 
value is scaled so as to enable this efficient representation of the data for one allophone. Using more standard, less 
efficient, representations of the formants would require forty-one or more bytes of data for each allophone. 
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TABLE 3 





FORMANT DATA SCALING 


5 


Parameter(s) 


# Bits Used* 


Scaling 




ALLOPHONE DATA TABLES: 




bb1, fb1 


8 


value/4 




bb2, fb2 


8 


(value-500)/8 


10 


bb3 fb3 


Q 

o 


ValUc/ 1 sJ 




bb4, fb4 


8 


value/16 




b3 


6 


value/8 




b5 


5 


value/12 


15 


b7 


5 


value/12 




FX1 


10 


code book 1 index value 




FX2 


9 


code book 2 index value 


20 


FX3 


7 


code book 3 index value 




FX4 


6 


code book 4 index value 




CODE BOOK VALUES: 


25 


v11 thru v41 


8 


value/4 




v12thru v42 


8 


(value-500)/8 




v13thru v43 


8 


value/16 




v14thru v44 


8 


value/16 


30 


t11 thru t44 


4 


percentage of duration of measured allophone, divided by 2 




* number of bits us 


ed for each parameter 





Note that the amount of data storage needed to store the formant parameters for 72,128 vowel allophones, at 38 bytes 
per allophone, is 2,740,864 bytes. 

Formant Code Books 



One embodiment of the present invention reduces the amount of data storage needed in two ways: (1) by using 
vector quantization to more efficiently encode the "intermediate" portions of the formants (i.e., vl through v4 and tl 
through t4), and (2) denoting "duplicate" allophones with virtually identical formant parameter sets. This section de- 
scribes the vector quantization used in the preferred embodiment. 

Figure 5 depicts a data structure herein called the code book 90 for one formant. Since each allophone is modelled 
as having four formants, the TTS system uses four code books 90a - 90d, as will be discussed in more detail below. 

For the purposes of this example, assume that the code book 90 in Figure 5 has 1000 rows of data. Each entry 
or row 92 contains the intermediate data values for one allophone formant: vl though v4 and tl through t4, as defined 
in Table 1 . 

Using the code book 90, the data 94 representing one allophone formant is now reduced to forward and back 
boundary values bb and fb, three bandwidth values b3, b5 and b7, and a pointer 96 to one entry (i.e., row) in the code 
book. Thus the amount of data storage required to store one allophone formant is now five bytes: one for the pointer 
96, two for the boundary values and two for the bandwidth values. For the fourth formant, the amount of storage required 
is three bytes because no bandwidth data is stored. Without the code book 90, the amount of storage required was 
ten bytes per formant, and eight for the fourth formant. 

Thus, if the code book 90 is considered to be a "fixed cost", the amount of storage for each allophone formant is 
reduced by half through the use of the code book. To show that this is a valid measurement of data compression 
consider the following. If code books are not used, the amount of data storage required to store the intermediate 
frequency and time values for 72,128 allophones is 24 bytes per allophone, or a total of 1,731,072 bytes. Four code 
books, with an average of 1000 entries each, occupy 24,000 bytes. Storing 72,128 allophones, using four one-byte 
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code book pointers per allophone, requires 288,51 2 bytes to store the pointers, plus 24,000 bytes for the code books, 
tor a total 312,512 bytes - as compared to 1 ,731 ,072 bytes without compression. This represents a compression ratio 
of about 5.5:1. 

The next issue is deciding which data values to store in the code book 90 for each formant. In other words, we 
5 must choose the 1 000 items 92 in the code book 90 wisely so that there will be an appropriate entry for every allophone 
in the English language. 

Referring to Figure 6, the four code books 90a - 90d for the four formants f1 - 14 are generated as follows. First, 
the speech of a single, selected person is recorded 100 while speaking each and every vowel allophone in the English 
(or another selected) language. Next, the recorded speech is digitized and processed to produce a spectrogram for 

10 each vowel allophone. Then, a trained technician selects representative formant frequency values from the formant 
trajectories of each vowel allophone. The result of this process is formant frequency and time data 104 for each of four 
formants for each of the vowel allophones in the English language. Of course, the process being described here can 
be performed with data from just a subset of the vowel allophones. 

It is noted that the TTS system 50 can be made to mimic any selected person, selected dialect, or even a selected 

15 cartoon character, simply by recording a person with the desired speech characteristics and then processing the re- 
sulting data. 

There is a well-known technique, called vector quantization, for "mapping" a sequence of continuous or discrete 
vectors into a smaller representative set of vectors. For a description of how vector quantization works, see Robert M. 
Gray, "Vector Quantization", IEEE ASSP Magazine, pp. 4-29, April 1984. 
20 Suffice it to say that given a set of 288,512 (i.e., 4 * 72,128) vectors (box 104 in Figure 6) of the form: 



(Vl,t1)(v2,t2)(v3,t3) (v4,t4) 

25 vector quantization can be used to generate the set of X vectors which produce the minimum "distortion". Given any 
value of X, such as 4000, the vector quantization process 1 06 will find the "best" set of vectors. This best set of vectors 
is called a "code book", because it allows each vector in the original set of vectors 1 04 to be represented by an "encoded" 
value - i.e., a pointer to the most similar vector in the code book. 

Generally, the best set of vectors is one which minimizes a defined value, called the distortion. In the preferred 

30 embodiment, the vector quantizer 1 06 implements a D minimax M method which selects a specified number of code book 
vectors from the set of all vowel allophone vectors such that the maximum weighted distance from the vectors in the 
set of vowel allophone vectors to the nearest code book vectors is minimized. The weighted distance between two 
vectors is computed as the area between the corresponding formant trajectories multiplied by 1/F, where F is the 
average of the forward and backward boundary values for the two trajectories. The distance is weighted by 1/F to give 

35 greater importance to lower frequencies, because lower frequencies are more important than higher ones in human 
perception of speech. It has been discovered that the minimax method results in higher quality speech than does an 
alternative method that minimizes the average of the distances from the vowel allophone vectors to their nearest code 
book vectors. See Eric Dorsey and Jared Bernstein, "Inter-Speaker Comparison of LPC Acoustic Space Using a Min- 
imax Distortion Measure," Proc. IEEE InH Conf. Acoustics, Speech and Signal Processing (1981) for a discussion of 

40 minimax distortion vector quantization as applied to LPC encoded speech. 

The vector quantization is performed once on the entire set of vowel allophone vectors representing data for all 
four formants to generate four formant code books 90a - 90d with a total specified size, such as 4000 rows, for the 
four code books. In other words, to form code book 90a, the selected vectors that represent formant f1 are stored in 
that code book. Similarly, selected vectors for formants 12, 13 and 14 are stored in code books 90b, 90c and 90d, 

45 respectively. The sum n1 + n2 + n3 + n4, where nx is the number of vectors in the code book for formant fx, is equal 
to the total code book size specified when the vector quantization process is performed. 

In the preferred embodiment, the number of items in each of the code books 90a - 90d is different because the 
different formants have differing amounts of variability. In general, n1 > n2 > n3 > n4, because use of the 1/F weighting 
factor gives lessor importance to differences between vectors representing higher formants with the result that fewer 

so vectors are selected for the higher formants. This is desirable because each higher formant is less critical to perceived 
vowel quality than the lower formants. In one version of the preferred embodiment the following values were used: n1 
- 741, n2 -451, n3 - 127 andn4 = 81. However, these values change when the allophone data is changed (e.g., when 
new allophone data is added). In the preferred embodiment n1 + n2 + n3 + n4 is set to a fixed size, such as 1400 or 
4000 (depending on the number of vectors being quantized), and the quantizer sets the individual sizes to minimize 

55 the overall weighted distortion. 

Once all of the code books have been generated, vector quantization is no longer used. Thus the completed TTS 
system need not incorporate a vector quantization capability. In the completed TTS system, each allophone is "encod- 
ed" using the four formant code books 90a - 90d with the parameters shown in Table 4. 
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TABLE 4 



PARAMETERS FOR ONE ALLOPHONE 



5 


Parameter(s) 


Description 




PA l - rA4 


inuices io entries in Tormani coob dooks i , c. > o ano 4 




hh1 hKA 


frequency at back boundary of allophone for formants 1-4 


10 


fh1 fh/1 
ID 1 - ID4 


frequency at forward boundary of allophone for formants 1-4 




b31 - b33 


bandwidth 30 milliseconds after back boundary for formants 1 - 3 


IS 


b51 - b53 


bandwidth 50 percent of the way through the duration of the allophone, for formants 1 - 3 




b71 - b73 


bandwidth 70 percent of the way through the duration of the allophone, for formants 1 - 3 


20 


LLRRx 


index into LLRR Context Table 




LLRRd 


index into LLRR Allophone Data Table for corresponding vowel phoneme 



It should be noted that in the preferred embodiment, the formant data in the code books 90a - 90d is derived from 
25 the speech of a single person, though the data for any particular vowel allophone may represent the most representative 
of several enunciations of the vowel allophone. This is different from most TTS synthesis systems and methods in 
which the formant and bandwidth data stored to represent phonemes is data which represents the -average" speech 
of a number of different persons. The inventors have found that the averaging of speech data from a number of persons 
tends to average out the tonal qualities which are associated with natural speech, and thus results in artificial sounding 
30 synthetic speech. 

Generating Vowel Allophones 

When converting text to speech vowel phonemes are converted into vowel allophones using the process shown 

35 in Figures 7 through 10. It is to be noted that the process of converting vowel phonemes is performed between boxes 
30 and 40 in the flow diagram of Figure 1 . Thus, at the beginning of this process, the phonemes preceding and following 
the vowel phoneme to be converted (the currently "selected" vowel phoneme) are known. 

For the purposes of this discussion, it should be understood that the term "vowel allophone" refers to the particular 
pronunciation of a vowel phoneme as determined by its neighboring phonemes. As explained below, there is concep- 

40 tually a distinct allophone for every PVP context of the vowel phoneme V However, some allophones are perceptually 
indistinguishable from others. For this reason, some vowel allophones are labelled "duplicate" allophones. To save on 
memory storage, the formant data representing such duplicate allophones is not repeated. 

Many vowels are diphthongs, gliding speech sounds that start with the acoustic characteristics of one vowel and 
move toward having those of another. The second part of a diphthong is called an "offglide". There are just a few, 

45 common off glides, so vowels fall into a few groups that have a common offglide, and therefore a common effect on a 
following phoneme. This has enabled the inventors to group preceding and following vowels into a few categories and 
to simplify the system to store and process 1156 (i.e., 34 x 34) CVC (i.e., consonant-vowel-consonant) contexts plus 
several CW (i.e., consonant-vowel-vowel), WC (i.e., vowel-vowel-consonant) and WV (vowel -vowel -vowel) contexts 
for each vowel phoneme instead of all 3136 (i.e., 56 x 56) PVP (phoneme-vowel-phoneme) contexts for each vowel. 

so Referring to Figure 7A, the first step of the vowel phoneme conversion process is to determine the context of the 

vowel phoneme. The identity of the most appropriate vowel allophone to be used is initially determined by the identity 
of the phonemes preceding and following selected vowel phoneme. 

Figure 7A shows a context index calculator 1 1 0. The input data to the context index calculator 1 1 0 are the phonemes 
P1 and P2 preceding and following the selected vowel phoneme V Initially we will assume that the neighboring pho- 

55 nemes are consonant phonemes. Of course, sometimes one of both of the neighboring phonemes are vowels, but we 
will deal with those cases separately. 

The Phoneme Index Table 1 1 2 converts any phoneme into an index value between 0 and 33, i.e. , one of 34 distinct 
values. In the preferred embodiment, there are 33 distinct consonant phonemes plus one for silence. Thus Phoneme 
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Index Table 112 generates a unique value for each consonant phoneme, including the silence phoneme. 

The Phoneme Index Table 112 is used to generate two index values 11 and 12, corresponding to the identities of 
the two neighboring phonemes P1 and P2, respectively. The context index calculator 110 then generates a CVC index 
value: 

5 

CVC Index = 12 + 34*11 

which uniquely identifies the "context" of a vowel phoneme - i.e., the preceding and following consonant phonemes. 
io in most cases, the CVC Index value can be used to correctly identify the vowel allophone associated with the vowel V 
When one of the neighboring phonemes is a vowel, the inventors have found that, for the purposes of selecting 
the most appropriate allophone, the following substitution process can be used. 



TABLE 5 



ALLOPHONE SUBSTITUTION TABLE FOR C-V1-V2 and V1-V2-C CONTEXTS 


V1 


REPLACE OUTER VOWEL WITH CONSONANT INDEX FOR: 


/ej/, /ij/, /ai/, or 


iy 


/ou/, /juw/, /uw/, /d/, or /au/ 


/w/ 


/3/, /ir/,/er/, /ur/, /br/, or /ar/ 


hi 


Id/, /a/, /a/, /ae/, /£/, /I/, /t/, or /U/ 


/?/ ! 



The PUP context is relabelled C-V1 -V2, or V1 -V2-C, as appropriate. To synthesize the inner vowel (V1 in the first 
25 case, V2 in the second), use the substitution values shown in Table 5 (in which phonemes are denoted using standard 
I PA symbols) so that a consonant is substituted for the outer vowel. Then the CVC index is computed, as explained 
above. 

To implement the vowel substitutions shown in Table 5, the Phoneme Index Table 112 includes entries for the 23 
vowel phonemes. The entries in the Phoneme Index Table 112 for vowel phonemes are set equal to the values for the 
30 substitute consonant phonemes specified in Table 5. Thus, the context of any and all vowel phonemes is computed 
simply by looking up the index values for the neighboring phonemes (regardless of whether they are consonants or 
vowels) and then using the CVC index formula shown above. 

It is to be noted that the "substitution" represented in Table 5 is used solely for the purpose of generating a CVC 
index value to represent the context of the selected vowel phoneme V The original "outer vowel" is used when syn- 
35 thesizing the outer vowel. 

Thus, at this point, whether the neighboring phonemes are consonants or vowels, we have a CVC index value 
representing the context of a selected vowel phoneme V 

Referring to Figure 7B, the formant parameters for a selected vowel phoneme V are generated as follows. There 
are 23 vowel phoneme-to-allophone decoders 120, one for each of the 23 vowel phonemes. As will be described in 
40 more detail, each vowel phoneme-to-allophone decoder 120 stores encoded data representing all of the vowel allo- 
phones for the corresponding vowel phoneme. 

Whenever a vowel phoneme is encountered in the string of phonemes that is being synthesized, the data for the 
corresponding allophone is generated as follows. First, the CVC index for the context of the vowel phoneme is calcu- 
lated, as described above with reference to Figure 7A. Then, the CVC index is sent by a software multiplexer 122 to 
^ the allophone decoder 120 for the corresponding vowel phoneme V 

The selected allophone decoder 120 outputs four code book index values FX1 - FX4, as well as a set of formant 
data values FD which will be described below. The allophone decoder 120 is shown in more detail in Figure 7C. The 
code books 90a - 90d output formant data FDC representing the central portions of the four speech formants for the 
selected vowel allophone. 

50 The combined outputs FD and FDC are sent to a parameter stream generator 124, which outputs new formant 

values to the formant synthesizer 62 (shown in Figure 2) once every 10 milliseconds for the duration of the allophone, 
thereby synthesizing the selected allophone. More generally, the parameter stream generator 1 24 continuously outputs 
formant data every 10 milliseconds to the formant synthesizer, with the formant data representing the stream of pho- 
nemes and/or allophones that are selected by earlier portions of the TTS conversion process. 

55 Figure 7C shows one vowel phoneme-to-allophone decoder 1 20. As explained above, there are 23 such decoders, 

one for each of the 23 vowel phonemes in the preferred embodiment. Thus the data stored in the decoder 1 20 represents 
the allophones for one selected vowel phoneme. 

The data representing all of the allophones associated with one vowel phoneme V is stored in a table called the 
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Allophone Data Table 1 30. 

Referring to Figure 8, each Allophone Data Table 1 30 contains separate records or entries 1 32 for each of a number 
of unique vowel allophones. Each record 132 in the Allophone Data Table 130 contains the set of data listed in Table 
3, as described above. In particular, the record 1 32 for any one allophone contains four code book indices FX1 - FX4, 
representing the center portions of the four formants 11 - 14 for the allophone, four values bb1 - bb4 representing the 
back boundary values of the four formants, four values fb1 - fb4 representing the forward boundary values of the four 
formants, nine bandwidth values b31 - b73 representing the bandwidths of the three lower formants 11 - 13 (as shown 
in Figure 3), and a value called LLRR which will be described below. 

The data values in the record 132 are scaled using the scaling and compression factors listed in Table 3. As a 
result, each record 132 occupies 19 bytes in the preferred embodiment. 

The Allophone Data Table 1 30 has two portions: one portion 1 34 for allophones identified by the PVP context (i. 
e. , the CVC index value) of the Vowel V, and a smaller portion 1 36 for the allophones identified by the expanded context 
LCVC or CVCR of the vowel V as will be explained in more detail below. The smaller portion 1 36, called the Extended 
Allophone Data Table, contains up to 16 records, each having the same formant as the records in the rest of the table 
130. 

While there are 1 1 56 possible CVC contexts for each vowel phoneme V, the inventors have further reduced memory 
requirements by selecting a number of "distinct allophones" which sound sufficiently distinct to require storage. The 
number of distinct allophones represented in the preferred embodiment is around 10,000 (less than half the number 
of CVC contexts), with the exact number depending on the methodology used to select them. Thus many vowel allo- 
phones are perceptually similar and can be considered to be "duplicate" allophones. It is noted that the selection of 
distinct allophones is inherently subjective, since it based on judgments by human technicians. 

Storing formant data for 26,588 allophones would require 505,172 bytes of storage (excluding the storage required 
for the code books 90a - 90d). On the other hand, storing formant data for only the 10,000 or so distinct allophones 
requires about 190,000 bytes of storage - which is a significant savings of memory storage for low cost TTS systems. 
As a result, only the distinct vowel allophones for a selected phoneme V are stored in each Allophone Data Table 1 30. 

Referring to Figure 7C, the purpose of the Allophone Context Table 140, Duplicate Context Table 144, and LLRR 
Table 148 is to enable the use of a compact Allophone Data Table 130 which stores data only for distinct allophones. 
These additional tables 1 40, 1 44 and 1 48 are used to convert the initial CVC index value into a pointer to the appropriate 
record in the Allophone Data Table 1 30. 

Figure 9 shows an Allophone Context Table 140, for one phoneme V. The purpose of the Allophone Context Table 
1 40 is to convert a CVC index value (calculated by the indexing mechanism shown in Figure 7A) into a Context Index CI. 

Each of the 23 Allophone Context Tables 1 40 contains a single Mask Bit, Mask(i), for each of the 1 1 56 CVC contexts 
for a vowel phoneme V. Distinct vowel allophones are denoted with a Mask Bit 142 equal to 1, and "duplicate" vowel 
allophones which are perceptually similar to one of the other vowel allophones are denoted with a Mask Bit of 0. 
Nonexistent allophones (i.e., CVC contexts not used in the English language) are also denoted with a Mask Bit equal 
toO. 

To find the CI index value for any particular vowel allophone, the Mask value Mask(CVC Index) is inspected. If the 
Mask Bit value is equal to 1 , the value of CI is computed as the sum of all the Mask Bits for CVC Index values less 
than or equal to the selected CVC Index value: 



N 

CI( N ) - I Mask( i ) 
i-0 

where N is equal to the CVC Index value that is being converted into a CI value. 

The number of unique vowel allophones for the selected vowel phoneme is CIMAX(V), which is also equal to CI 
for the largest CVC index with a nonzero Mask Bit. CIMAX(V) is furthermore equal to the number of records 1 32 in the 
main portion 134 of the Allophone Data Table 130. Referring to Figure 8, the number of entries 132 in the Allophone 
Data Table 1 30 is CIMAX(V) + 16, for reasons which will be explained below. 

If the selected Mask Bit 142 equals 0, the selected allophone is a "duplicate", and a substitute CVC index value 
is obtained from the Duplicate Context Table 144. The substitute CVC index value is guaranteed to have a Mask Bit 
equal to 1 , and is used to compute a new CI index value as described above. 

More particularly, to find the CI value for a particular "duplicate" allophone, the synthesizer looks through the 
records 146 of the Duplicate Context Table 144 for the CVC index value of the duplicate allophone. When the CVC 
index value is found, the new CVC value in the same record replaces the original CVC index value, and the CI com- 
putation process is restarted. 

As shown in Figure 9, the Duplicate Context Table 144 comprises a list of "old" or original CVC Index Values and 
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corresponding "new CVC" values, with two bytes being used to represent each CVC value. In other words, the Table 
144 comprises a set of four byte records 146, each of which contains a pair of corresponding CVC Index and "new 
CVC" values. The only "old" CVC Index values included in the Duplicate Context Table 144 are those for existent 
allophones which have a Mask Bit value of 0 in the Allophone Context Table 140. Thus the Duplicate Context Table 
5 U4 will typically contain many fewer records 146 than there are Mask Bits 142 with values of zero. In the preferred 
embodiment, the number of entries in the Duplicate Context Table 1 44 varies from 24 to 11 1 , depending on the vowel 
phoneme V. 

Should the selected CVC value not be found in the Duplicate Context Table, this would mean that a previously 
unknown allophone context has been encountered. In this case, the TTS synthesizer synthesizes the allophone using 
io a standard "default" context for all allophones. In an alternate embodiment, such allophones could be synthesized 
using the "synthesis by rule" methodology previously used in Speech Plus Prose 2000 product (described above with 
reference to Figure 1). 

In another embodiment of the invention, the Duplicate Context Table 144 stores the CI value for each duplicate 
allophone. Since the CI value occupies the same amount of storage space as a replacement CVC value, the alternate 
is embodiment avoids the computation of CI values for those allophones which are "duplicate" allophones. 

In yet another alternate embodiment of the invention, the Allophone Context Table 1 40 (for one vowel V) comprises 
a table of two byte index values CI, with one CI value for each of the 1156 possible CVC index values. By eliminating 
the Mask Table 150, the alternate embodiment occupies about 2000 bytes of extra storage per vowel phoneme V, but 
reduces the computation time for calculating CI. 
20 Referring to Figure 7C, we now have a CI index value which points to one record in the Allophone Data Table 1 30. 

As mentioned above, the data in each record 132 of the Allophone Data Table 130 includes an entry called LLRR. 
LLRR actually has two components: LLRRx (the low-order four bits) and LLRRd (the high-order four bits). 

LCVC and CVCR Contexts 

25 

In a relatively small number of cases, the selection of the proper vowel allophone depends not just on the imme- 
diately neighboring phonemes, but also on the phoneme just to the left or to the right of these neighboring phonemes. 
The "expanded" context of selected vowel phoneme can be labelled: 
LCVC or CVCR. 

30 Thus there are multiple allophones for a small number of CVC contexts. The inventors have found that, for any 

one CVC context, there is at most one LCVC or CVCR context which has a distinct enunciation of the vowel allophone 
V. As a result, a relatively small LLRR Context Table 148 and a similarly small Extended Allophone Data Table 136 
can be used to represent and store the formant data for these allophones. 

The LLRRx value in each Allophone Data Table record denotes whether there is more than one allophone for the 
35 selected CVC context, and thus whether the "expanded" LCVC or CVCR context of the allophone must be considered. 
If LLRRx is equal to zero, the allophone data specified by the previously calculated value of CI is used. If LLRRx is not 
equal to zero, then an additional computation is needed. 

Referring to Figure 10, there is an LLRR Context Table 148 for each vowel phoneme V The Table 148 contains 
fifteen entries or records, each of which identifies an "extended" context. More particularly, the Table 148 can denote 
40 up to fifteen Left or Right Phonemes which identify an extended LCVC or CVCR context. 

Each LLRR Context Table record has two values: LRI and CC. The value of LLRRx determines which entry in the 
Table 1 48 is to be used. Note that there is no entry for LLRRx = 0 because a value of zero indicates that the expanded 
context need not be considered. 

CC denotes a phoneme value, and LRI is a "left or right" indicator. When LRI is equal to 0, the phoneme to the left 
45 of the CVC context is compared with the phoneme denoted by CC; when LRI is equal to 1 , the phoneme to the right 
of the CVC context is compared with the CC phoneme. Only if the selected left or right phoneme matches the CC 
phoneme is a "new LLRR CI value" calculated. 

If the selected left or right phoneme does not match the CC phoneme, then the data pointed to by CI is the data 
used to generate the allophone. If there is a match, however, the LLRRd value acts as a pointer to a record in the 
50 extended portion 1 36 of the Allophone Data Table 1 30 shown in Figure 8. In effect, the CI value is replaced with a value of 

CIMAX(V) + LLRRd 

55 where CIMAX(V) is the number of records in the main portion 134 of the Allophone Data Table 130. 

While there are only sixteen possible values of LLRRd in the preferred embodiment, in alternate embodiments a 
full byte could be used to represent LLRRd, allowing for a much larger number of extended context allophones. Note 
that there is not a one to one correspondence between the entries in the LLRR Table 1 48 and the Extended Allophone 
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Data Table 136. In fact, there can be several Extended Allophone Data Table entries for a single LLRR Table entry 
because one LLRR Table entry can define the context of several allophones. 

Allophone Synthesis Method 

s 

Referring once again to Figure 7C, the process for synthesizing a particular vowel phoneme V is as follows. First 
a CVC index value is computed by the context index calculator 110. Then, using the allophone decoder 120 for the 
selected vowel phoneme V, a CI index value is computed using the Allophone Context Table 1 40 and Duplicate Context 
Table 144. The CI index value points to a record in the Allophone Data Table 1 30, which contains formant data for the 
to allophone. However, if the LLRR value in the selected Allophone Data record has a value of LLRRx*0, and the expanded 
context LCVC or CVCR matches the specified value in the LLRR Table 148, a new CI value replaces the old one and 
a new record of data in the Allophone Data Table 1 30 is used. 

The data record 132 of the Allophone Data Table 130 pointed to by CI includes four pointers FX1 - FX4 to records 
in the four formant code books 90a - 90d. The data record 132 also includes back boundary and forward boundary 
15 values for the four formants, and a sequence of three bandwidth values for each of the first three formants. The formant 
parameters representing the four formant frequency trajectories for the vowel allophone include the data values from 
the four selected code book records as well as the data values in the selected Allophone Data Table record. 

These formant parameters are then processed by a parameter stream generator 124. This generator 124 interpo- 
lates between the selected formant values to compute dynamically changing formant values at 10 millisecond intervals 
20 from the start of the vowel to its end. For each formant, quadratic smoothing is used from the back boundary at the 
start of the vowel to the first "target" value retrieved from the code book. Linear smoothing is performed between the 
four target values retrieved from the code book, and also between the fourth code book value and the forward boundary 
value at the end of the vowel. 

Most contexts require smoothing of the formants backward into the preceding consonant in order to assure a 
25 continuous formant track. To do this, interpolation is done from the vowel's back boundary value to a formant value in 
the preceding consonant. Consonants for which this is not done are those where a discontinuity is desired in formants 
f2, f3 and f4, namely the nasal consonants (m, n and ng) and stop consonants (p, t, k, b, d, g). 

For each formant, the bandwidth is linearly smoothed from the last bandwidth value of the preceding phoneme to 
the 30 ms bandwidth value b3x, then to the midpoint bandwidth value b5x, then to the 75% value b7x, and then to the 
30 boundary of the next phoneme. 

Alternate Embodiments 

While the present invention has been described with reference to a few specific embodiments, the description is 

35 illustrative of the invention and is not to be construed as limiting the invention. 

In particular, it is noted that the data compression methods used in the preferred embodiment are dictated by the 
need to store all the vowel allophone data in a space of 256k bytes or less. If the storage space limits are relaxed, 
because of relaxed cost criteria or reduced memory costs, a number of simplifications of the data structures well known 
to those skilled in the art could be employed. 

40 For instance, as noted above, the allophone context table 1 40 and duplicate context table 1 44 could be combined 

and simplified at a cost of around 45k bytes. At a cost of approximately 256k, formant data can be stored for every 
CVC context, thereby eliminating the need for the Allophone Context Table 140 and Duplicate Context Table 144 
altogether. 

In other alternate embodiments, bandwidth values could be stored in code books much as the formant values are 
45 stored in the preferred embodiment. Similarly, code books could be used to store formant parameter vectors that include 
the backward and forward formant boundary values (instead of the above described code books, which store vectors 
that include only the intermediate formant parameters). These alternate embodiments would increase the amount of 
data compression obtained from the use of code books, but would degrade the quality of the synthesized allophones. 
It is also noted that each TTS system can store allophone data representative of the pronunciation of a selected 
50 individual, a selected dialect, a selected cartoon character, or a language other than English. The only difference 
between these embodiments of the present invention's vowel allophone production system is the allophone data stored 
in the system. In still other embodiments in which there is more memory available for allophone storage, multiple sets 
of allophone data could be stored so that a single TTS system could generate synthetic speech which mimics several 
different persons or dialects. 

55 Finally, it is noted that in an alternate embodiment of the present invention vowel allophones could be stored using 

speech parameters that are based on a different representation of human speech than the formant parameters de- 
scribed above. It is well known to those skilled in the art that there several alternate methods of representing synthetic 
speech using speech parameters other than formant parameters. The most widely used of these other methods is 
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known as LPC (linear predictive coding) encoded speech. 

Referring to Figure 11, in an alternate embodiment of the invention each distinct vowel allophone is represented 
by a set of stored LPC encoded data. Note that Figure 11 is the same as Figure 7C, except for the data and code book 
tables. The LPC data for each vowel allophone is a set of parameters which can be considered to be a vector. Synthetic 
5 speech is generated from LPC parameters by processing the LPC parameters with a digital signal processor (i.e., a 
digital filter network). While the digital signal processors used with LPC parameters are different than the digital signal 
processors used with formant parameters, both types of digital signal processors are well known in the prior art and 
can be considered to be analogous for the purposes of the present invention. 

Since the LPC parameters for each vowel allophone is a vector, the amount of storage required to represent these 
10 vectors can be greatly reduced using the vector quantization scheme described above. In particular, the intermediate 
portions of the LPC vectors for atl the vowel allophones can be processed by a minimax distortion vector quantization 
process, as described above, to produce the best set of N vectors (e.g., 4000 LPC vectors) for representing the inter- 
mediate portions of the LPC vectors. The resulting N vectors would be stored in a single parameter code book 152. 

The LPC Allophone Data Table 150 will store forward and back LPC boundary values, bandwidth values, LLRR, 
is and a single index into the parameter code book 152. 

The methodology for selecting vowel allophones and retrieving the data representing a selected vowel allophone 
is unchanged from the preferred embodiment, except that now there is only one code book entry that is retrieved 
(instead of four). The parameters selected from the Allophone Data Table 150 and the parameter code book 152 are 
sent to the parameter stream generator 124 for inclusion in the stream of data sent to the synthesizer's digital signal 
20 processor. 

In yet other embodiments of the present invention, other methods of representing vowel allophones with speech 
parameters can be used. Several such alternate methods are known to the prior art, and new parameter representations 
of speech may be developed in the future. 

In all such alternate embodiments, the primary differences from the preferred embodiment would be in the vowel 
25 allophone data stored, and in the apparatus used to convert the vowel allophone data into synthetic speech. The 
number of code books used to compress the vowel allophone parameters will vary depending on the nature of parameter 
representation being used. Nevertheless, the system architecture shown in Figure 11 can be applied to all of these 
embodiments because the basic methodology for selecting vowel allophones and retrieving the data representing a 
selected vowel allophone is unchanged. 

30 

Claims 

1 . A text-to-speech synthesis system, comprising: 

35 

text conversion means (20, 22, 24) for converting a specified text string into a corresponding string of consonant 
and vowel phonemes (25), each said phoneme being selected from a predefined set of phonemes including 
a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; 

parameter generating means (40) for generating speech parameters corresponding to said string of phonemes 
40 (25); and 

speech synthesizing means (42) for generating a speech waveform corresponding to the speech parameters 
generated by said parameter generating means; 

characterised by: 

45 

vowel allophone storage means (90, 1 30) storing a multiplicity of predefined vowel allophones, each vowel 
allophone being represented by a set of speech parameters; said vowel allophones including allophones for 
a multiplicity of vowel phonemes; 

vowel phoneme to allophone conversion means (1 20), coupled to said text conversion means (20, 22, 24) and 
50 said vowel allophone storage means, for computing a phoneme context value for each of at least a subset of 

said vowel phonemes in said string of phonemes (25), said phoneme context value comprising a function of 
the phonemes in said string of phonemes (25) which precede and follow said vowel phoneme, and for then 
assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said 
computed phoneme context value; 
55 said parameter generating means (40) including means for generating speech parameters for said assigned 

vowel allophones. 

2. The text-to-speech conversion system of claim 1 , further characterised by: 
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context table means (140) for assigning one of said vowel allophones to every vowel phoneme context LVR, 
where V represents any vowel phoneme selected from said multiplicity of vowel phonemes, L represents any 
consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of pho- 
nemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected 
from said predefined set of phonemes; said context table means (140) including a distinct entry for every 
phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context 
LVR; and 

said vowel phoneme to allophone conversion means (120) including allophone selection means coupled to 
said context table means (140) for selecting one of said multiplicity of vowel allophones for each of at least a 
subset of said vowel phonemes in said string of phonemes (25), said allophone selection means including 
context indexing means 1 1 0 for determining the phonemes in said string which immediately precede and follow 
said vowel phoneme in said string of phonemes, and table lookup means for assigning to said vowel phoneme 
the vowel allophone denoted in said context table means (140) for said vowel phoneme in the context of said 
preceding and following phonemes. 

3. The text-to-speech conversion system of claim 1 or claim 2, further characterised by: 

said vowel allophone storage means (90, 130) including: 

speech storage means for storing the speech parameters for each said vowel allophone; said speech storage 
means including code book means (90) for storing a multiplicity of sets of speech parameters; and 
allophone means (130) for denoting, for each said vowel allophone, one of said multiplicity of sets of speech 
parameters in said code book means (90). 

4. The text-to-speech conversion system of claim 2, further characterised by: 

said context indexing (110) means including vowel substitution means (1 1 2) for use when a vowel phoneme 
Vt in said string of phonemes (25) is immediately preceded or followed by a vowel phoneme, said vowel substitution 
means (112) including means for selecting an entry in said context table means (140) to use for assigning one of 
said vowel allophones to said vowel phoneme V v 

5. The text-to-speech conversion system of claim 2, further characterised by: 

said context indexing means (110) including vowel substitution means (112) for use when a vowel phoneme 
Vt in said string of phonemes (25) occurs in a phoneme context CV, V 2 or VgV^, where C is a consonant phoneme 
and V 1 is a vowel phoneme neighboring said vowel phoneme V 1t said vowel substitution means (112) including 
means for selecting one of said phoneme contexts LVR which is phonetically equivalent to said phoneme context 
CV-jVg or VgV^; said table lookup means including means for assigning to said vowel phoneme V 1 the vowel 
allophone denoted in said context table means (140) for said phonetically equivalent phoneme context LVR. 

6. The text-to-speech conversion system of any one of claims 1 to 5, characterised in that: 

said speech parameters are formant parameters. 

7. The text-to-speech conversion system of claim 6, characterised in that: 

the number of sets of formant parameters stored in said code book means (90) is much less than the number 
of vowel allophones stored by said vowel allophone storage means (90, 130); the sets of formant parameters 
stored in said code book means (90) being selected from sets of formant parameters representing substantially 
all of said vowel allophones using a minimax distortion vector quantization process. 

8. The text-to-speech conversion system of any one of claims 1 to 6, further characterised by: 

each vowel allophone in said vowel allophone storage means including a set of back and forward boundary 
parameters representative of speech formants at the boundaries of the allophone, and a set of intermediate pa- 
rameters representative of speech formants between the back and forward boundaries of the allophone. 

9. The text-to-speech conversion system of claim 8, further characterised by: 

each said set of intermediate parameters in said code book means (90) representing the intermediate tra- 
jectory of one formant for a vowel allophone; said allophone storage means (90, 1 30) including means for denoting 
at least three of said sets of intermediate formant parameters; whereby said vowel allophones comprise the formant 
parameters for at least three formants. 

10. The text-to-speech conversion system of any one of claims 1 to 9, further characterised by: 



15 



EP 0 458 859 B1 

said vowel allophone storage means (90, 1 30) including means for storing vowel allophones as pronounced 
by a selected individual so that said text-to-speech conversion system produces synthetic speech which mimics 
said selected individual speaking. 

The text-to-speech conversion system of any one of claims 1 to 9, further characterised by: 

said vowel allophone storage means (90, 1 30) including means for storing vowel allophones as pronounced 
by an individual speaking a selected dialect so that said text-to-speech conversion system produces synthetic 
speech which mimics said selected dialect. 

The text-to-speech conversion system of any one of claims 1 to 9, further characterised by: 

said vowel allophone storage means including means for storing vowel allophones as pronounced by a spec- 
ified cartoon character so that said text-to-speech conversion system produces synthetic speech which mimics 
said selected cartoon character 

The text-to-speech conversion system of any one of claims 1 to 9, further characterised by: 

said vowel allophone storage means (90, 1 30) including means for storing vowel allophones as pronounced 
by a plurality of selected individuals so that said text-to-speech conversion system produces synthetic speech 
which mimics a plurality of selected individuals. 

A method of converting text strings into synthetic speech, the steps comprising: 

defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel pho- 
nemes; 

converting a specified text string into a corresponding string of phonemes (25), said string of phonemes in- 
cluding consonant and vowel phonemes, each said phoneme being selected from said defined set of pho- 
nemes; and 

converting said string of phonemes (25) into speech parameters and then generating an audio waveform 
corresponding to said speech parameters; 

50 characterised by: 

storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of 
speech parameters; 

for each of at least a subset of said vowel phonemes in said string of phonemes (25), computing a phoneme 
35 context value for said vowel phoneme as a function of the phonemes in said string of phonemes which precede 

and follow said vowel phoneme, and then assigning to said vowel phoneme a selected one of said predefined 
vowel allophones corresponding to said computed phoneme context value; and 

said converting step including converting said assigned vowel allophones into speech parameters which are 
then used to generate an audio waveform corresponding to said speech parameters. 

40 

15. The method of claim 14, further characterised by: 

storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of 
speech parameters; denoting in a data structure an assigned one of said vowel allophones for every phoneme 

45 context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of 

vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V se- 
lected from said predefined set of phonemes, and R represents any consonant phoneme immediately following 
said vowel phoneme V selected from said predefined set of phonemes; said data structure containing a distinct 
allophone assignment entry for each said phoneme context LVR; and 

50 for each vowel phoneme in at least a subset of said vowel phonemes in said string of phonemes (25), deter- 

mining the phonemes in said string which immediately precede and follow said vowel phoneme in said string 
of phonemes, and then assigning said vowel phoneme the vowel allophone denoted in said data structure for 
said vowel phoneme in the context of said preceding and following phonemes. 

55 16. The method of claim 14, further characterised by: 

said storing step including providing code book means (90) for storing a multiplicity of sets of speech param- 
eters, and providing allophone means (130) for denoting, for each said vowel allophone, one of said multiplicity of 
sets of speech parameters in said code book means (90). 
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17. The method of claim 1 6, characterised in that: 

the number of sets of speech parameters stored in said code book means (90) is much less than said pre- 
defined multiplicity of vowel allophones; the sets of speech parameters stored in said code book means (90) being 
selected from sets of speech parameters representing substantially all of said vowel allophones using a minimax 
distortion vector quantization process. 

18. The method of any one of claims 14 to 17, further characterised in that: 

said storing step storing vowel allophones as pronounced by a selected individual so that said method pro- 
duces synthetic speech which mimics said selected individual speaking. 



19. The method of any one of claims 14 to 18, characterised in that: 

said speech parameters are formant parameters. 

20. The method of claim 1 9, further characterised by: 

15 said storing step including providing code book means (90) for storing a multiplicity of sets of formant pa- 

rameters, and allophone means (130) for denoting, for each said vowel allophone, one of said multiplicity of sets 
of formant parameters in said code book means ( 90 ). 

21. The method of claim 19, characterised in that: 

20 the number of sets of formant parameters stored in said code book means (90) is much less than said pre- 

defined multiplicity of vowel allophones; the sets of formant parameters stored in said code book means (90) being 
selected from sets of formant parameters representing substantially all of said vowel allophones using a minimax 
distortion vector quantization process. 

25 22. The method of claim 1 8, further characterised by: 

said storing step including storing vowel allophones as pronounced by a selected individual so that said 
method produces synthetic speech which mimics said selected individual speaking. 



30 Patentanspruche 

1. System zur Synthese von gesprochener Sprache aus Text mit den folgenden Elementen: 

Textumsetzeinrichtungen (20, 22, 24) zum Umsetzen eines bestimmten Textstrings in einen entsprechenden 
35 String von Konsonanten- und Vokalphonemen (25), wobei das jeweilige Phonem aus einer vorbestimmten 

Gruppe von Phonemen ausgewahit wird, die aus vielen Konsonantphonemen und vielen Vokalphonemen 
besteht; 

einer Parametererzeugungseinrichtung (40) zum Erzeugen von Sprechparametern, die dem Phonemstring 
(25) entsprechen; und 

40 einer Sprechsprachesyntheseeinrichtung (42) zum Erzeugen von Sprachwellen, die den von Parameterer- 

zeugungseinrichtung erzeugten Sprechparametern entsprechen; 

gekennzeichnet durch: 

45 Vokalallophonspeichereinrichtungen (90, 1 30), die eine Vielzahl vorbestimmter Vokalallophone speichern, wo- 

bei jedes Vokalallophon durch einen Satz von Sprechparametern reprasentiert wird; wobei die vokalallophone 
Allophone fur eine Vielzahl von Vokalphonemen enthalten; 

eine Vokalphonem-Allophon-Umsetzeinrichtung (120), die mit den Textumsetzeinrichtungen (20, 22, 24) und 
Vokalallophonspeichereinrichtungen in Verbindung steht, die dazu dient, einen Phonemkontextwert fur jedes 
50 aus mindestens einer Untergruppe der Vokalphoneme im Phonemstring (25) zu berechnen, wobei der Pho- 

nemkontextwert eine Funktion der Phoneme in dem Phonemstring (25) aufweist, die dem vbkalphonem vor- 
angehen und folgen, und dann dazu, dem Vbkalphonem ein ausgewahltes der vorbestimmten Vokalallophone 
zuzuordnen, das dem berechneten Phonemkontextwert entspricht; 

wobei die Parametererstellungseinrichtung (40) eine Einrichtung zum Erzeugen von Sprechparametern fur 
55 die zugeteilten Vokalallophone aufweist. 

2. System zur Synthese von gesprochener Sprache aus Text nach Anspruch 1 , weiter gekennzeichnet 
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durch eine Kontexttabelleneinrichtung (140) die dazu dient, einem jeden Vokalphonem kontext LVR eines der 
Vokalallphone zuzuordnen, wobei V ein aus der Vielzahl von Vokalphonemen ausgewahltes Vokalphonem 
reprasentiert, L ein dem Vokalphonem V unmittelbar vorausgehendes Konsonantphonem, das aus der vor- 
bestimmten Gruppe von Phonemen ausgewahlt ist, reprasentiert und R ein dem Vokalphonem V unmittelbar 
folgendes Konsonantphonem, das aus der vorbestimmten Gruppe von Phonemen ausgewahlt ist, reprasen- 
tiert; wobei die Kontexttabelleneinrichtung (140) fur jeden Phonemkontext LVR einen eigenen Eintrag hat, in 
dem testgelegt wird, welche der Vokalallophone dem jeweiligen Phonemkontext LVR zugeordnet werden; und 
dadurch, daB die Vokalphonem-Allophon-Umsetzeinrichtung (120) eine mit der Kontexttabelleneinrichtung 
(140) verbundene Allophonauswahleinrichtung zum Auswahlen eines der vielen Vokalallophone fur jedes von 
mindestens einer Untergruppe der Vokalphoneme in dem Phonemstring (25), wobei die Allophonauswahlein- 
richtung eine Kontextindizierungseinrichtung (110) zum Bestimmen derjenigen Phoneme in dem String auf- 
weist, die dem Vokalphonem in dem Phonemstring unmittelbar vorausgehen bzw. folgen, und eine Tabellen- 
sucheinrichtung, die dazu dient, dem Vokalphonem das in der Kontexttabelleneinrichtung (140) aufgefuhrte 
Vokalallophon fur das Vokalphonem im Kontext des vorhergehenden und nachfolgenden Phonems zuzuord- 
nen. 

System zur Synthese von gesprochener Sprache aus Text nach Anspruch 1 oder 2, weiter dadurch gekennzeichnet 
daB 

die Vokalallophonspeichereinrichtungen (90, 130) die folgenden Elemente aufweisen: 
eine Sprachspeichereinrichtung zum Speichern der Sprechparameter fur das jeweilige Vokalallophon; wobei 
die Sprachspeichereinrichtung eine Codebucheinrichtung (90) zum Speichern vieler Sprechparametersatze 
aufweist; und 

eine Allophoneinrichtung (130) zum Festlegen eines der vielen Sprechparametersatze in der Codebuchein- 
richtung (90) fOr das jeweilige Vokalallophon. 

System zur Synthese von gesprochener Sprache aus Text nach Anspruch 2, weiter dadurch gekennzeichnet daB 
die Kontextindizierungseinrichtung (110) eine Vokalsubstituierungseinrichtung (112) aufweist, die dann ein- 
gesetzt wird, wenn einem Vokalphonem V 1 in dem Phonemstring (25) ein Vokalphonem unmittelbar vorangeht 
bzw. folgt, wobei die Vokalsubtituierungseinrichtung (112) eine Einrichtung zum Auswahlen eines Eintrags in der 
Kontexttabelleneinrichtung (140) aufweist, die dem Vokalphonem V A eines der Vokalallophone zuteilt. 

System zur Synthese von gesprochener Sprache aus Text nach Anspruch 2, weiter dadurch gekennzeichnet daB 
die Kontextindizierungseinrichtung (110) eine Vokalsubstituierungseinrichtung (112) aufweist, die dann ein- 
gesetzt wird, wenn ein Vokalphonem V., in dem Phonemstring (25) in einem Phonemkontext CV^ oder V 2 V.,C 
vorkommt, wobei C ein Konsonantphonem und V 2 ein neben dem Vokalphonem vorkommendes Vokalphonem 
ist, wobei die Vokalsubstitutierungseinrichtung (112) eine Einrichtung zum Auswahlen eines der Phonemkontexte 
LVR aufweist, der phonetisch mit dem Phonemkontext CV^ oder VgV^ Equivalent ist; wobei die Tabellensuch- 
einrichtung eine Einrichtung, die dazu dient dem Vokalphonem das in der Kontexttabelleneinrichtung (140) auf- 
gefuhrte Vokalallophon V, fur den phonetisch aquivalenten Phonmen kontext LVR zuzuordnen, aufweist. 

System zur Synthese von gesprochener Sprache aus Text nach einem der Anspruche 1 bis 5, dadurch gekenn- 
zeichnet daB 

die Sprechparameter Formantenparameter sind. 

System zur Synthese von gesprochener Sprache aus Text nach Anspruch 6, dadurch gekennzeichnet daB 

die Anzahl in der Codebucheinrichtung (90) gespeicherter Formantenparametersatze weit geringer ist als 
die Anzahl in den Vokalallophonspeichereinrichtungen (90, 130) gespeicherter Vokalallophone; wobei die Satze 
in der Codebucheinrichtung (90) gespeicherter Formantentparameter unter Verwendung eines 
Minimaxverzerrungsvektorquantisierungsverfahrens aus Satzen von Formantenparametern ausgewahlt werden, 
die im wesentlichen alle Vokalallophone reprasentieren. 

System zur Synthese von gesprochener Sprache aus Text nach einem der Anspruche 1 bis 6, weiter dadurch 
gekennzeichnet daB 

jedes Vokalallophon in den Vokalallophonspeichereinrichtungen einen Satz von Vorder- und Hinterbegren- 
zungsparametern aufweist, die Sprachformanten an den Grenzen der Allophone reprasentieren, und einen Satz 
von Zwischenparametern, die Sprachformanten zwischen den Vorder- und Hinterbegrenzungen des Allophons 
reprasentieren. 
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9. System zur Synthese von gesprochener Sprache aus Text nach Anspruch 8, weiter dadurch gekennzeichnet, daft 

jeder der Zwischenparametersatze in der Codebucheinrichtung (90) den Zwischenverlauf eines Formanten 
fur ein Vokalallophon reprasentiert; wobei die Allophonspeichereinrichtungen (90, 130) eine Einrichtung zum Fest- 
legen von mindestens drei der Satze von Zwischenformantenparametern aufweist; wobei die Vbkaiallophone die 
Formantenparameter fur mindestens drei Formanten aufweisen. 

10. System zur Synthese von gesprochener Sprache aus Text nach einem der Anspruche 1 bis 9, weiter dadurch 
gekennzeichnet daG 

die Vokalallophonspeichereinrichtungen (90, 130) eine Einrichtung zum Speichern von Vokalallophnen auf- 
weist, wie diese von einer ausgewahlten Einzelperson ausgesprochen werden, so daG das System zur Synthese 
von gesprochener Sprache aus Text eine synthetisierte gesprochene Sprache erzeugt, die die Sprechweise der 
ausgewahlten Einzelperson nachahmt. 

11. System zur Synthese von gesprochener Sprache aus Text nach einem der AnsprOche 1 bis 9, weiter dadurch 
gekennzeichnet daG 

die Vokalallophonspeichereinrichtung (90, 130) eine Einrichtung zum Speichern von von einer einen be- 
stimmten Dialekt sprechenden Einzelperson ausgesprochenen Vokalallophon en aufweist, so daG das System zur 
Synthese von gesprochener Sprache aus Text eine synthetisierte gesprochene Sprache erzeugt, die den ausge- 
wahlten Dialekt nachahmt. 

12. System zur Synthese von gesprochener Sprache aus Text nach einem der Anspruche 1 bis 9, weiter dadurch 
gekennzeichnet daG 

die Vokalallophonspeichereinrichtung eine Einrichtung zum Speichern von von einer bestimmten Zeichen- 
trickfilmfigur ausgesprochenen Vokalallophonen aufweist, so daG das System zur Synthese von gesprochener 
Sprache aus Text eine synthetisierte gesprochene Sprache erzeugt, die die ausgewahlte Zeichentrickfilmfigur 
nachahmt. 

13. System zur Synthese von gesprochener Sprache aus Text nach einem der Anspruche 1 bis 9, weiter dadurch 
gekennzeichnet daG 

die Vokalallophonspeichereinrichtung (90, 130) eine Einrichtung zum Speichern von von mehreren Einzel- 
person ausgesprochenen Vokalallophonen aufweist, so daG das System zur Synthese von gesprochener Sprache 
aus Text eine synthetisierte gesprochene Sprache erzeugt, die mehrere ausgewahlte Einzelpersonen nachahmt. 

14. Verfahren zum Umsetzen von Textstrings in synthetisierte geprochene Sprache, mit den folgenden Schritten: 

Definieren eines Satzes von Phonemen, mit einer Vlelzahl von Konsonantenphonemen und einer Vielzahl von 
Vokalphonemen; 

Umsetzen eines vorbestimmten Textstrings in einen entsprechenden Phonemstring (25), wobei der Phonem- 
string Konsonanten- und Vokalphoneme aufweist, wobei jedes der Phoneme aus dem definierten Phonemsatz 
ausgewahlt ist; und 

Umsetzen des Phonemstrings (25) in Sprechparameter und dann Erzeugen den Sprechparametern entspre- 
chender Audiowellen; 

gekennzeichnet durch: 

Speichern einer Vielzahl vorbestimmter Vbkaiallophone, wobei jedes Vokalallophon durch einen Satz von 
Sprechparametern reprasentiert wird; 

Berechnen eines Phonemkontextwerts fur das Vokalphonem in Abhangigkeit von denjenigen Phonemen im 
Phonemstring, die dem Vokalphonem vorausgehen und folgen, fur jeden aus mindestens einer Untergruppe 
von Vokalphonemen im Phonemstring (25), und den Schritt, in dem dann dem Vokalphonem ein ausgewahltes 
der vorbestimmten Vbkaiallophone, die dem errechneten Phonemkontextwert entsprechen, zugeordnet wird; 
und 

wobei der Umsetzschritt beinhaltet, daG die zugeordneten Vbkaiallophone in Sprechparameter umgesetzt 
werden, die dann zum Erzeugen von den Sprechparametern entsprechenden Audiowellen verwendet werden. 

15. Verfahren nach Anspruch 14, weiter gekennzeichnet durch: 

Speichern einer Vielzahl vorbestimmter Vbkaiallophone, wobei jedes Vokalallophon durch einen Satz von 
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Sprechparametern reprasentiert wird; Festlegen eines zugeordneten der "vokalallophone in einer Datenstruk- 
tur fur jeden Phonemkontext LVR, wobei V ein aus mindestens einer Untergruppe der Vielzahl von Vokalpho- 
nemen ausgewahltes Vokalphonem reprasentiert, L ein aus der vorbestimmten Gruppe von Phonemen aus- 
gewahltes, dem Vokalphonem V unmittelbar vorausgehendes Konsonantphonem reprasentiert, und R ein aus 
der vorbestimmten Gruppe von Phonemen ausgewahltes, dem Vokalphonem V unmittelbar folgendes Kon- 
sonantphonem reprasentiert; wobei die Datenstruktur einen eigenen Allophonzuordnungseintrag fur jeden der 
Phonemkontexte LVR enthalt; und 

Bestimmen der Phoneme in dem String, die dem Vokalphonem in dem Phonemstring unmittelbar vorausgehen 
und folgen fur jedes Vokalphonem in mindestens einer Untergruppe der Vokalphoneme in dem Phonemstring 
(25), und den Schritt, in dem dann dem Vokalphonem das Vokalallophon, das in der Datenstruktur fur das 
Vokalphonem im Kontext der vorangehenden und folgenden Phonems festgelegt ist, zugeordnet wird. 

16. Verfahren nach Anspruch 14, weiter dadurch gekennzeichnet, daG 

der Speicherschritt das Vorsehen einer Codebucheinrichtung (90) zum Speichern einer Vielzahl von Sprech- 
parametersatzen und Vorsehen einer Allophoneinrichtung (1 30) zum Festlegen eines der Vielzahl von Sprechpa- 
rametersatzen in der Codebucheinrichtung (90) fur jedes der Vokalallophone beinhaltet. 

17. Verfahren nach Anspruch 16, dadurch gekennzeichnet, daG 

die Anzahl in der Codebucheinrichtung (90) gespeicherter Sprechparametersatze viel geringer ist als die 
vorbestimmte Vielzahl von Vokalallophonen; wobei die in der Codebucheinrichtung (90) gespeicherten Sprechpa- 
rametersatze unter Verwendung eines Minimaxverzerrungsvektorquantisierungsverfahrens aus im wesentlichen 
alle Vokalallophone reprasentierenden Sprechparametersatzen ausgewahlt sind. 

18. Verfahren nach einem der Anspruche 14 bis 17, weiter dadurch gekennzeichnet, daG 

beim Speicherschritt Vokalallophone so gespeichert werden, wie sie von einer ausgewahlten Einzelperson 
ausgesprochen werden, so daG das Verfahren synthetisierte gesprochene Sprache erzeugt, die die Sprechweise 
der ausgewahlte Einzelperson nachahmt. 

19. Verfahren nach einem der Anspruche 14 bis 18, dadurch gekennzeichnet, daG 

die Sprechparameter Formantenparameter sind. 

20. Verfahren nach Anspruch 1 9, weiter dadurch gekennzeichnet, daG 

der Speicherschritt das Vorsehen einer Codebucheinrichtung (90) zum Speichern einer Vielzahl von For- 
mantenparametersatzen und einer Allophoneinrichtung (130) zum Festlegen fur jedes der Vokalallophone einen 
der Vielzahl der Formantenparametersatze in der Codebucheinrichtung (90) beinhaltet. 

21. Verfahren nach Anspruch 19, dadurch gekennzeichnet, daG 

die Anzahl der in der Codebucheinrichtung (90) gespeicherten Formantenparametersatze wesentlich gerin- 
ger ist als die vorbestimmte Vielzahl von Vokalallophonen; wobei die in der Codebucheinrichtung (90) gespeicher- 
ten Formantenparametersatze unter Verwendung eines Minimaxverzerrungsvektorquantisierungsverfahrens aus 
im wesentlichen alle Vokalallophone reprasentierenden Formantenparametersatzen ausgewahlt sind. 

22. Verfahren nach Anspruch 18, weiter dadurch gekennzeichnet, daG 

beim Speicherschritt Vokalallophone so gespeichert werden, wie sie von einer ausgewahlten Einzelperson 
ausgesprochen werden, so daG das Verfahren synthetisierte gesprochene Sprache erzeugt, die die Sprechweise 
der ausgewahlte Einzelperson nachahmt. 



Revendlcations 

1 . Systeme de synthase convertissant du texte en paroles, comprenant : 

un dispositif (20, 22, 24) de conversion de texte destine* a transformer une chaine sp6cifi6e de texte en une 
chaine correspondante de phonemes (25) de consonnes et de voyelles, chaque phoneme 6tant choisi dans 
un ensemble preddfini de phonemes comprenant plusieurs phonemes de consonnes et plusieurs phonemes 
de voyelles, 

un dispositif g6n6rateur de parametres (40) destine* a cr6er des parametres de parole qui correspondent a la 
chame de phonemes (25), et 
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tin dispositif (42) de synthase de paroles destine k creer une forme d'onde de parole correspondant aux 
parametres de parole cre£s par le dispositif g6nerateur de parametres, 

caracterise par 

un dispositif (90, 130) de memorisation d'allophones de voyelles memorisant de multiples allophones pr6de- 
finis de voyelles, chaque allophone de voyelle etant repr6sent6 par un ensemble de parametres de parole, 
les allophones de voyelles contenant des allophones pour de multiples phonemes de voyelles, 
un dispositif (120) de conversion de phonemes de voyelles en allophones, couple au dispositif de conversion 
de texte (20, 22, 24), et au dispositif de memorisation d'allophones de voyelles pour le calcul d'une valeur de 
contexte de phoneme pour chaque phoneme d'au moins un sous-ensemble des phonemes de voyelles de la 
chaine de phonemes (25), la valeur du contexte de phoneme comprenant une fonction des phonemes de la 
chaine de phonemes (25) qui precede et qui suit le phoneme de voyelle, et destine k affecter au phoneme de 
voyelle un allophone choisi parmi les allophones predefinis de voyelles correspondant k la valeur calcul6e du 
contexte de phoneme, 

le dispositif g§n6rateur de parametres (40) comprenant un dispositif destine k creer les parametres de parole 
pour les allophones de voyelles qui sont affectes. 

Systeme de conversion de texte en paroles selon la. revendication 1, caracterise en outre par 

un dispositif (140) k table de contexte destine k affecter Pun des allophones de voyelles k chaque contexte 
de phoneme de voyelle LVR, V representant un phoneme quelconque de voyelle choisi parmi les multiples 
phonemes de voyelles, L representant un phoneme quelconque de consonne precedant immediatement le 
phoneme de voyelle V choisi parmi I'ensemble pr6d6fini de phonemes, et R representant un phoneme de 
consonne suivant immediatement le phoneme de voyelle V choisi parmi I'ensemble predefini de phonemes, 
le dispositif (1 40) k table de contexte comprenant une entree distincte pour chaque contexte de phoneme LVR 
indiquant quel allophone de voyelle est affecte k chaque contexte de phoneme LVR, et 
le dispositif (120) de conversion de phoneme de voyelle en allophone comprenant un dispositif de selection 
d'allophone couple au dispositif (140) k table de contexte pour la selection de I'un des allophones parmi les 
multiples allophones de voyelles pour chaque phoneme d'au moins un sous-ensemble de phonemes de voyel- 
les de la chaine de phonemes (25), le dispositif de selection d'allophones comprenant un dispositif (110) 
d'indexation de contexte destine k determiner les phonemes de la chaine qui precede et suit immediatement 
le phoneme de voyelle de la chaine de phonemes, et un dispositif de consultation de table destine k affecter 
au phoneme de voyelle I'allophone de voyelle designe dans le dispositif a table de contexte (140) pour le 
phoneme de voyelle du contexte des phonemes precedent et suivant. 

Systeme de conversion de texte en paroles selon la revendication 1 ou 2, caracterise en outre en ce que 

le dispositif (90, 130) de memorisation d'allophones de voyelles comprend : 

un dispositif de memorisation de paroles destine k memoriser des parametres de parole pour chaque allo- 
phone de voyelle, le dispositif de memorisation de paroles comprenant un dispositif (90) k livre de code destine 
k memoriser de multiples ensembles de parametres de parole, et 

un dispositif k allophones (130) destine k designer, pour chacun des allophones de voyelles, un ensemble 
parmi les multiples ensembles de parametres de parole du dispositif (90) k livre de code. 

Systeme de conversion de texte en paroles selon la revendication 2, caracterise en outre par un dispositif (110) 
d'indexation de contexte qui comprend un dispositif (112) de substitution de voyelle destine k etre utilise lorsqu'un 
phoneme de voyelle de la chaine de phonemes (25) est immediatement pr6c6d6 ou suivi d'un phoneme de 
voyelle, le dispositif (112) de substitution de voyelle comprenant un dispositif destine k s6lectionner une entree 
du dispositif (140) k table de contexte destin6e k §tre utilis6e pour I'affectation de I'un des allophones de voyelles 
au phoneme de voyelle V 1 . 

Dispositif de conversion de texte en paroles selon la revendication 2, caracterise en outre en ce que le dispositif 
(110) d'indexation de contexte comporte un dispositif (1 1 2) de substitution de voyelle destine k etre utilise lorsqu'un 
phoneme de voyelle de la chaine de phonemes (25) apparaTt dans un contexte de phonemes V 2 ou V^C, 
C etant un phoneme de consonne et V2 etant un phoneme de voyelle voisin du phoneme de voyelle V, , le dispositif 
(112) de substitution de voyelle comprenant un dispositif de selection de I'un des contextes de phonemes LVR qui 
est equivalent phon6tiquement au contexte de phonemes CV A V 2 ou V 2 V.,C, le dispositif de consultation de table 
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comportant un dispositif d'affectation au phoneme de voyelle de Pallophone de voyelle d6signe dans le dispositif 
(140) a table de contexte pour le contexte de phonemes equivalant phonetiquement LVR. 

6. Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 5, caracterise en ce 
5 que les parametres de parole sont des parametres de formant. 

7. Systeme de conversion de texte en paroles selon la revendication 6, caracterise en ce que le nombre d'ensembles 
de parametres de formant memorises dans le dispositif (90) a livre de code est bien inferieur au nombre d'allo- 
phones de voyelles memorise par le dispositif (90, 1 30) de Memorisation d'allophones de voyelles, les ensembles 

10 de parametres de formant memorises dans le dispositif (90) a livre de code etant selectionnes parmi les ensembles 

de parametres de formant representant pratiquement la totalite des allophones de voyelles a I'aide d'une operation 
de numerisation du vecteur de distorsion minimale-maximale. 

Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 6, caracterise en outre 
en ce que chaque allophone de voyelle du dispositif de memorisation d'allophones de voyelles comprenant un 
ensemble de parametres de limites arriere et avant representatifs des formants de parole aux limites de I'allophone 
et un ensemble de parametres intermediates representatifs des formants de parole entre les limites avant et 
arriere de I'allophone. 

9. Systeme de conversion de texte en paroles selon la revendication 8, caracterise en ce que chaque ensemble de 
parametres intermediates du dispositif (90) a livre de code represente la trajectoire intermediate d'un formant 
destine a un allophone de voyelle, le dispositif (90, 130) de memorisation d'allophones comprenant un dispositif 
destine a designer au moins trois des ensembles des parametres intermediaires de formants, si bien que les 
allophones de voyelles comprennent les parametres de formant d'au moins trois formants. 

10. Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 9, caracterise en outre 
en ce que le dispositif (90, 1 30) de memorisation d'allophones de voyelles comporte un dispositif de memorisation 
d'allophones de voyelles telles qu'elles sont prononcees par un individu choisi si bien que le systeme de conversion 
de texte en paroles donne des paroles synthetiques qui imitent la facon individuelle choisie de parler. 

11. Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 9, caracterise en outre 
en ce que le dispositif (90, 1 30) de memorisation d'allophones de voyelles comporte un dispositif de memorisation 
d'allophones de voyelles prononcees par un individu parlant un dialecte choisi si bien que le systeme de conversion 
de texte en paroles produit des paroles synthetiques qui imitent le dialecte choisi. 

12. Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 9, caracterise en outre 
ce que le dispositif de memorisation d'allophones de voyelles comporte un dispositif de memorisation des allo- 
phones de voyelles tels qu'ils sont prononces par un personnage specifie de bande dessinee si bien que le systeme 
de conversion de texte en paroles produit des paroles synthetiques qui imitent le personnage choisi de bande 
dessinee. 

13. Systeme de conversion de texte en paroles selon Tune quelconque des revendications 1 a 9, caracterise en outre 
en ce que le dispositif (90, 1 30) de memorisation d'allophones de voyelles comporte un dispositif de memorisation 
d'allophones de voyelles prononcees par plusieurs individus choisis de maniere que le systeme de conversion de 
texte en paroles produise des paroles synthetiques qui imitent plusieurs individus choisis. 

14. Procede de conversion de chaines de texte en paroles synthetiques, comprenant les etapes suivantes : 

la definition d'un ensemble de phonemes, comprenant de multiples phonemes de consonnes et de multiples 
phonemes de voyelles, 

la conversion d'une chaTne specifiee de texte en une chaine correspondante de phonemes (25), la chaine de 
phonemes comprenant des phonemes de consonnes et de voyelles, chaque phoneme etant choisi parmi 
I'ensemble defini de phonemes, et 

la conversion de la chaine de phonemes (25) en parametres de parole, puis la creation d'une forme d'onde 
d'audiofrequences correspondant aux parametres de parole, 

caracterise par 
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la memorisation de multiples allophones pr6d6finis de voyelles, chaque allophone de voyelle 6tant represente 
par un ensemble de parametres de parole, 

pour chaque phoneme d'au moins un sous-ensemble de phonemes de voyelles de la chaTne de phonemes 
(25), le calcul d'une valeur de contexte de phoneme pour le phoneme de voyelle en fonction des phonemes 
s de la chaTne de phonemes qui precede et suit le phoneme de voyelle, et I'affection au phoneme de voyelle 

d'un allophone choisi parmi les allophones de voyelles predefines correspondant a la valeur calcuiee du 
contexte de phonemes, et 

I'etape de conversion comprend la conversion des allophones affectes de voyelles aux parametres de parole 
qui sont alors utilises pour la creation d'une forme d'onde d'audiofr6quences correspondant aux parametres 
10 de parole. 

15. Procede selon la revendication 14, caracterise en outre par : 

la memorisation de multiples allophones de voyelles pred6finies, chaque allophone de voyelle etant repr6sent6 
75 par un ensemble de parametres de parole, la designation, dans une structure de donn6es, d'un allophone 

affecte parmi les allophones de voyelles pour chaque contexte de phoneme LVR, V representant un phoneme 
quelconque de voyelle choisi parmi au moins un sous-ensemble de multiples phonemes de voyelles, L repre- 
sentant un phoneme quelconque de consonne pr6c6dant immediatement le phoneme de voyelle V choisi dans 
I'ensemble predetermine de phonemes, et R representant un phoneme de consonne quelconque suivant im- 
20 mediatement le phoneme de voyelle V choisi parmi I'ensemble pr6defini de phonemes, la structure de donn6es 

contenant une entree distincte d'affection d'allophone pour chaque contexte de phoneme LVR, et 
pour chaque phoneme de voyelle d'un sous-ensemble au moins des phonemes de voyelles de la chaine de 
phonemes (25), la determination des phonemes de la chaine qui precede et suit immediatement le phomene 
de voyelle de la chaine de phonemes, puis I'affection au phoneme de voyelle de I'allophone de voyelle designe 
25 dans la structure de donnees pour le phoneme de voyelle du contexte des phonemes precedent et suivant. 

16. Procede selon la revendication 14, caracterise par une etape de memorisation qui comprend la formation d'un 
dispositif (90) a livre de code destine a memoriser de multiples ensembles de parametres de parole, et la trans- 
mission d'un dispositif (130) a allophones pour designer, pour chaque allophone de voyelle, un ensemble parmi 

30 les multiples ensembles de parametres de parole du dispositif (90) a livre de code. 

17. Proc6d6 selon la revendication 16, caracterise en ce que le nombre d'ensembles de parametres de parole me- 
morises dans le dispositif (90) a livre de code est bien inferieur au nombre predefini d'allophones de voyelles, les 
ensembles de parametres de parole memorises dans le dispositif (90) a livre de code etant choisis parmi les 

35 ensembles de parametres de parole representant pratiquement tous les allophones de voyelles par une operation 

de num6risation vectorielle a distorsion minimale-maximale. 

1 8. Procede selon Tune quelconque des revendications 1 4 a 1 7, caracterise en outre en ce que retape de memorisation 
comprend la memorisation d'allophones de voyelles prononces par un individu choisi de maniere que le procede 

40 produise des paroles synthetiques qui imitent la facon de parler de I'individu choisi. 

19. Procede selon I'une des revendications 14 a 18, caracterise en ce que les parametres de parole sont des para- 
metres de formant. 

45 20. Proc6d6 selon la revendication 1 9, caracterise en outre en ce que retape de memorisation comprend la disposition 
d'un dispositif (90) a livre de code destine a memoriser de multiples ensembles de parametres de formant, et un 
dispositif a allophones (1 30) destine a designer, pour chaque allophone de voyelle, un ensemble parmi de multiples 
ensembles de parametres de formant dans le dispositif (90) a livre de code. 

so 21 . Proc6d6 selon la revendication 1 9, caracterise en ce que le nombre d'ensembles de parametres de formant con- 
serves dans le dispositif a livre de code (90) est tres inferieur au nombre predefini d'allophones de voyelles, les 
ensembles de parametres de formant conserves dans le dispositif (90) a livre de code sont choisis parmi des 
ensembles de parametres de formant representant pratiquement tous les allophones de voyelles a I'aide d'une 
operation de numerisation vectorielle a distorsion minimale-maximale. 



55 



22. Procede selon la revendication 18, caracterise en outre en ce que retape de memorisation comprend la memori- 
sation d'allophones de voyelles prononces par un individu choisi afin que le procede cr6e des paroles synthetiques 
qui imitent la facon de parler de ('individu choisi. 
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